39 research outputs found

    Ensemble Data Mining Methods

    Get PDF
    Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, Le., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models

    Classification

    Get PDF
    A supervised learning task involves constructing a mapping from input data (normally described by several features) to the appropriate outputs. Within supervised learning, one type of task is a classification learning task, in which each output is one or more classes to which the input belongs. In supervised learning, a set of training examples---examples with known output values---is used by a learning algorithm to generate a model. This model is intended to approximate the mapping between the inputs and outputs. This model can be used to generate predicted outputs for inputs that have not been seen before. For example, we may have data consisting of observations of sunspots. In a classification learning task, our goal may be to learn to classify sunspots into one of several types. Each example may correspond to one candidate sunspot with various measurements or just an image. A learning algorithm would use the supplied examples to generate a model that approximates the mapping between each supplied set of measurements and the type of sunspot. This model can then be used to classify previously unseen sunspots based on the candidate's measurements. This chapter discusses methods to perform machine learning, with examples involving astronomy

    Ask-The-Expert: Minimizing Human Review for Big Data Analytics Through Active Learning

    Get PDF
    In this CIF project, we worked toward semi-automating knowledge discovery from anomaly detection algorithms through the use of active learning. Active learning is an area of research within machine learning that uses an "expert in the loop" to learn from large data sets that have very few annotations or labels available, and where providing such labels is expensive. In our case, the task can be defined as the identification of safety events from flight operational data. Since traditional anomaly detection algorithms cannot differentiate between operationally relevant and irrelevant statistical anomalies, Subject Matter Experts (SMEs) have a lengthy and expensive burden of investigating every example identified by the detection algorithm, classifying and labeling them as relevant or irrelevant. Active learningidentifies the unlabeled example for which a label would most improve the classifier, asks the domain expert for a label, and repeats this process until there are no more resources (time, budget) available for labeling or a minimum required performance is reached. A positive label indicates an operationally significant safety event whereas a negative label indicates otherwise. Based on these few labels we propose to build an active learning system that utilizes the SME's time in the most effective manner by iteratively asking for labels for as few informative instances as possible. Our work was proposed to be a stepping stone toward implementation and deployment of the system with user interface to be pursued by the Aviation Operations and Safety Program (AOSP) given its interest in safety monitoring and discovery of safety incidents

    Multiple Kernel Learning for Heterogeneous Anomaly Detection: Algorithm and Aviation Safety Case Study

    Get PDF
    The world-wide aviation system is one of the most complex dynamical systems ever developed and is generating data at an extremely rapid rate. Most modern commercial aircraft record several hundred flight parameters including information from the guidance, navigation, and control systems, the avionics and propulsion systems, and the pilot inputs into the aircraft. These parameters may be continuous measurements or binary or categorical measurements recorded in one second intervals for the duration of the flight. Currently, most approaches to aviation safety are reactive, meaning that they are designed to react to an aviation safety incident or accident. In this paper, we discuss a novel approach based on the theory of multiple kernel learning to detect potential safety anomalies in very large data bases of discrete and continuous data from world-wide operations of commercial fleets. We pose a general anomaly detection problem which includes both discrete and continuous data streams, where we assume that the discrete streams have a causal influence on the continuous streams. We also assume that atypical sequence of events in the discrete streams can lead to off-nominal system performance. We discuss the application domain, novel algorithms, and also discuss results on real-world data sets. Our algorithm uncovers operationally significant events in high dimensional data streams in the aviation industry which are not detectable using state of the art method

    nu-Anomica: A Fast Support Vector Based Novelty Detection Technique

    Get PDF
    In this paper we propose nu-Anomica, a novel anomaly detection technique that can be trained on huge data sets with much reduced running time compared to the benchmark one-class Support Vector Machines algorithm. In -Anomica, the idea is to train the machine such that it can provide a close approximation to the exact decision plane using fewer training points and without losing much of the generalization performance of the classical approach. We have tested the proposed algorithm on a variety of continuous data sets under different conditions. We show that under all test conditions the developed procedure closely preserves the accuracy of standard one-class Support Vector Machines while reducing both the training time and the test time by 5 - 20 times

    Fast and Flexible Multivariate Time Series Subsequence Search

    Get PDF
    Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which often contain several gigabytes of data. Surprisingly, research on MTS search is very limited. Most of the existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two algorithms to solve this problem (1) a List Based Search (LBS) algorithm which uses sorted lists for indexing, and (2) a R*-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences. Both algorithms guarantee that all matching patterns within the specified thresholds will be returned (no false dismissals). The very few false alarms can be removed by a post-processing step. Since our framework is also capable of Univariate Time-Series (UTS) subsequence search, we first demonstrate the efficiency of our algorithms on several UTS datasets previously used in the literature. We follow this up with experiments using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>99%) thus needing actual disk access for only less than 1% of the observations. To the best of our knowledge, MTS subsequence search has never been attempted on datasets of the size we have used in this paper

    Fast Multivariate Search on Large Aviation Datasets

    Get PDF
    Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations Both these tests show that our algorithms have very high prune rates (>95%) thus needing actua

    Anomaly Detection Techniques with Real Test Data from a Spinning Turbine Engine-Like Rotor

    Get PDF
    Online detection techniques to monitor the health of rotating engine components are becoming increasingly attractive to aircraft engine manufacturers in order to increase safety of operation and lower maintenance costs. Health monitoring remains a challenge to easily implement, especially in the presence of scattered loading conditions, crack size, component geometry, and materials properties. The current trend, however, is to utilize noninvasive types of health monitoring or nondestructive techniques to detect hidden flaws and mini-cracks before any catastrophic event occurs. These techniques go further to evaluate material discontinuities and other anomalies that have grown to the level of critical defects that can lead to failure. Generally, health monitoring is highly dependent on sensor systems capable of performing in various engine environmental conditions and able to transmit a signal upon a predetermined crack length, while acting in a neutral form upon the overall performance of the engine system

    Toward Justifiable Trust in Autonomous Systems Incorporating Human Knowledge in Autonomous Systems through Machine Learning

    Get PDF
    Trust in Autonomous Systems is largely about humans trusting the decisions made by autonomous systems. This trust can be increased through learning from domain experts. In particular, autonomous systems can learn offline from past mission operations before conducting any operations of its own. Additionally, autonomous systems can learn online by obtaining human feedback during operations. We will discuss several classes of machine learning methods and our application of them to autonomous systems. The first class of methods is anomaly detection, which uses operations data to identify examples of anomalous operations. The second class of methods is inverse reinforcement learning, also known as apprenticeship learning, that takes past operations data as input and yields a controller that is able to duplicate the operations described by the data. The third class is active learning, which identifies examples on which the model is most uncertain and requests domain expert feedback

    Active Learning with Rationales for Identifying Operationally Significant Anomalies in Aviation

    Get PDF
    A major focus of the commercial aviation community is discovery of unknown safety events in flight operations data. Data-driven unsupervised anomaly detection methods are better at capturing unknown safety events compared to rule-based methods which only look for known violations. However, not all statistical anomalies that are discovered by these unsupervised anomaly detection methods are operationally significant (e.g., represent a safety concern). Subject Matter Experts (SMEs) have to spend significant time reviewing these statistical anomalies individually to identify a few operationally significant ones. In this paper we propose an active learning algorithm that incorporates SME feedback in the form of rationales to build a classifier that can distinguish between uninteresting and operationally significant anomalies. Experimental evaluation on real aviation data shows that our approach improves detection of operationally significant events by as much as 75% compared to the state-of-the-art. The learnt classifier also generalizes well to additional validation data sets
    corecore